The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

نویسندگان

  • Bahman Javadi
  • Derrick Kondo
  • Alexandru Iosup
  • Dick H. J. Epema
چکیده

With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)—an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results ∗Corresponding author. Telephone: +61-2-9685 9181; Fax: +61-2-9685 9245 ∗∗Corresponding author. Telephone: +31-15-2784433; Fax: +31-15-2786632 Email addresses: [email protected] (Bahman Javadi), [email protected] (Alexandru Iosup) Preprint submitted to Journal of Parallel and Distributed Computing March 28, 2013 underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failureaware algorithms, when applied for general rather than for domain-specific distributed systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Distributed Authentication Model for an E-Health Network Using Blockchain

Introduction: One of the most important and challenging areas under the influence of information technology is the field of health. This pervasive influence has led to the development of electronic health (e-health) networks with a variety of services of different qualities. The issue of security management, maintaining confidentiality and data integrity, and exchanging it in a secure environme...

متن کامل

A Distributed Authentication Model for an E-Health Network Using Blockchain

Introduction: One of the most important and challenging areas under the influence of information technology is the field of health. This pervasive influence has led to the development of electronic health (e-health) networks with a variety of services of different qualities. The issue of security management, maintaining confidentiality and data integrity, and exchanging it in a secure environme...

متن کامل

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single ...

متن کامل

EFFEC T OF HEMODI ALYSIS ON TRACE ELEMENTS IN PATIENTS WITH ACUTE AND CHRONIC RENAL FAILURE

Hemodialysis is being implicated in the development of metabolic disturbances, as complications have been observed and the role of trace metals in their development has been questioned. In 78 renal failure patients who underwent hemodialysis, serum levels of zinc and copper were determined before and after first hemodialysis. Acute and chronic renal failure patients were found to have lowe...

متن کامل

Numerical Modeling of Rock Slopes with a Potential of Block-Flexural Toppling Failure

One of the most important instabilities of rock slopes is toppling failure. Among the types of toppling failure, block-flexural failures are more common instability which occurs in nature. In this failure, some rock blocks break because of tensile stresses, and some overturn under their weights, and next to all of them topple together. Physical and theoretical modeling of this failure is studie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 73  شماره 

صفحات  -

تاریخ انتشار 2013